Skip to content

perf: reduce HashMap/collection allocation overhead in gateway path#48662

Draft
xinlian12 wants to merge 15 commits intoAzure:mainfrom
xinlian12:perf/hashmap-collection-allocation
Draft

perf: reduce HashMap/collection allocation overhead in gateway path#48662
xinlian12 wants to merge 15 commits intoAzure:mainfrom
xinlian12:perf/hashmap-collection-allocation

Conversation

@xinlian12
Copy link
Copy Markdown
Member

@xinlian12 xinlian12 commented Apr 1, 2026

Performance: Reduce HashMap/Collection Allocation Overhead in Gateway Path

Motivation

JFR profiling of the baseline (main) under high-concurrency gateway workloads revealed that HashMap-related allocations (HashMap$Node, HashMap, HashMap$ValueIterator) and HTTP header collections (DefaultHeaders$HeaderEntry, HttpHeader) are responsible for a significant share of total object allocation churn.

Baseline JFR allocation profile (c128 Read HTTP/1, ObjectAllocationSample, 10-min recording):

Class % of Total Allocation
HashMap$Node 6.9%
DefaultHeaders$HeaderEntry 6.8%
HashMap$ValueIterator 1.3%
HttpHeader 0.9%
HashMap 0.7%
HttpHeaders 0.6%
HashMap$Node[] 0.5%
Total targeted ~10.9%

Root causes:

  1. HashMap<>() default initial capacity (16) forces 1-2 resize+rehash cycles for typical gateway responses with 20-30 headers, creating throwaway HashMap$Node[] arrays and re-hashed HashMap$Node entries
  2. StoreResponse constructor converts HttpHeaders to Map via HttpUtils.asMap() on every response, allocating a throwaway HashMap$ValueIterator and rebuilding all HashMap$Node entries
  3. HttpHeaders in RxGatewayStoreModel.getHttpRequestHeaders() is undersized, causing internal HashMap resize
  4. Redundant toLowerCase() calls on header keys that are already normalized

Changes

  1. Right-sized HashMap initial capacity: HashMap<>(32) instead of HashMap<>() in RxDocumentServiceRequest, and mapCapacityForSize() helper in HttpUtils to avoid rehashing
  2. Eliminate HashMap to HttpHeaders to HashMap round-trip: StoreResponse now accepts HttpHeaders directly, removing intermediate asMap() conversion that created throwaway HashMap$ValueIterator and HashMap$Node arrays
  3. Pre-sized HttpHeaders in RxGatewayStoreModel: sized to defaultHeaders.size() + headers.size() to avoid internal HashMap resize
  4. Remove redundant toLowerCase() calls: HttpHeaders.set() already normalizes keys; callers no longer double-normalize creating extra String objects

Benchmark Results

Test matrix: 1 tenant x {c1, c8, c16, c32, c128} concurrency x {Read, Write} x {HTTP/1, HTTP/2} x 3 rounds each, GATEWAY mode, 10 min/run.

Throughput Summary (ops/s, 3-round average +/- stddev)

Config Conc main (baseline) hashmap-alloc (PR) Delta
Read/HTTP1 c1 433 +/-41 460 +/-37 +6.1%
Read/HTTP1 c8 4,897 +/-135 4,971 +/-108 +1.5%
Read/HTTP1 c16 7,639 +/-680 7,305 +/-171 -4.4%*
Read/HTTP1 c32 21,297 +/-1,476 19,913 +/-329 -6.5%*
Read/HTTP1 c128 54,528 +/-1,555 54,223 +/-1,462 -0.6%
Read/HTTP2 c1 414 +/-36 408 +/-39 -1.4%
Read/HTTP2 c8 4,866 +/-453 4,659 +/-67 -4.3%*
Read/HTTP2 c16 6,974 +/-156 6,884 +/-150 -1.3%
Read/HTTP2 c32 19,553 +/-1,724 18,488 +/-144 -5.4%*
Read/HTTP2 c128 47,133 +/-393 48,856 +/-650 +3.7%
Write/HTTP1 c1 179 +/-1 170 +/-1 -5.2%
Write/HTTP1 c8 1,676 +/-9 1,726 +/-41 +3.0%
Write/HTTP1 c16 3,138 +/-88 3,131 +/-97 -0.2%
Write/HTTP1 c32 7,302 +/-178 7,301 +/-234 -0.0%
Write/HTTP1 c128 13,628 +/-15 13,643 +/-34 +0.1%
Write/HTTP2 c1 160 +/-0 159 +/-2 -0.2%
Write/HTTP2 c8 1,652 +/-47 1,619 +/-2 -2.0%
Write/HTTP2 c16 3,055 +/-68 2,969 +/-94 -2.8%
Write/HTTP2 c32 7,031 +/-228 7,024 +/-232 -0.1%
Write/HTTP2 c128 13,648 +/-24 13,664 +/-5 +0.1%

Variance Analysis

The apparent -4% to -6% deltas at mid-concurrency (c16/c32) are not SDK regressions they are caused by server-side transit time variability between rounds.

A dedicated 6-round reproducibility study (1t-c32-ReadThroughput-http1) with request-level metrics enabled confirms this:

Metric main (6 rounds) hashmap-alloc (6 rounds)
Avg throughput 21,346 ops/s 19,793 ops/s
Stddev 1,541 352
CV (coefficient of variation) 7.2% 1.8%

The request-level breakdown shows the variance lives entirely in transitTime (server round-trip), not in SDK-side processing:

Round main ops/s main transitTime (ms) hashmap ops/s hashmap transitTime (ms)
r1 20,021 1.346 20,136 1.343
r2 19,417 1.406 20,226 1.350
r3 22,905 1.141 19,476 1.404
r4 22,952 1.141 20,045 1.355
r5 20,020 1.353 19,538 1.396
r6 22,763 1.144 19,335 1.411

hashmap-alloc has 4x lower CV (1.8% vs 7.2%).

Root cause of transit time variance (confirmed via TCP socket analysis): The bimodal pattern is caused by Azure Traffic Manager (ATM) routing the regional Cosmos endpoint (benchmark-cosmos-lx1-westus2.documents.azure.com) to different frontend nodes on each JVM restart. Confirmed by running ss -i -t -n state established '( dport = :443 )' mid-run across 6 rounds all 32 connections land on the same IP within each round, and that IP alternates perfectly with throughput: 20.9.156.133 (fe2, slow ~19.3K ops/s) vs 20.42.170.147 (fe7, fast ~20.2K ops/s). The ATM CNAME record has TTL=20s, so each JVM restart resolves to whichever frontend ATM selects at that moment. This is infrastructure variability unrelated to SDK code.

GC Comparison (c128 Read HTTP/1, r1)

Metric main hashmap-alloc
GC pause count 817 813
Mean pause 2.36 ms 2.38 ms
P99 pause 7.40 ms 7.66 ms
Total pause time 1,929 ms 1,935 ms

GC behavior is identical between branches. At single-tenant scale with an 8 GB heap, the allocation reduction does not materially change GC frequency or pause time. The benefit is reduced unnecessary work (fewer resize/rehash cycles, fewer throwaway iterators) which would compound at higher tenant density.

JFR Allocation Comparison All Configs

ObjectAllocationSample comparison for aggregate allocation share of all 9 targeted classes.

Note on HashMap$ValueIterator: This PR eliminates the response-side HttpUtils.asMap() iterator. A separate HashMap$ValueIterator still exists on the request-sending side (ReactorNettyClient.bodySendDelegate) this is expected and not targeted by this PR.

Config main targeted % hashmap-alloc targeted % Delta (pp)
c1-Read/http1 11.7% 14.4% +2.7
c8-Read/http1 22.7% 10.6% -12.1
c16-Read/http1 9.2% 14.1% +4.9
c32-Read/http1 11.2% 12.8% +1.7
c128-Read/http1 20.4% 17.4% -3.0
c1-Read/http2 11.4% 10.4% -1.1
c8-Read/http2 11.9% 7.1% -4.8
c16-Read/http2 9.1% 9.0% -0.1
c32-Read/http2 14.6% 10.5% -4.1
c128-Read/http2 16.9% 15.7% -1.1
c1-Write/http1 11.2% 3.5% -7.7
c8-Write/http1 15.2% 20.3% +5.0
c16-Write/http1 8.0% 17.2% +9.2
c32-Write/http1 17.7% 22.2% +4.5
c128-Write/http1 16.5% 10.1% -6.5
c1-Write/http2 9.1% 6.2% -2.9
c8-Write/http2 15.7% 18.7% +2.9
c16-Write/http2 16.0% 12.3% -3.7
c32-Write/http2 18.0% 11.8% -6.2
c128-Write/http2 8.5% 13.1% +4.6

Note on JFR sampling noise: Individual per-config percentages can swing +/-5pp between runs. The consistently observable patterns are:

  1. HashMap$ValueIterator is eliminated in most configs (the asMap() round-trip is removed)
  2. At high concurrency (c128), targeted allocation share drops consistently (Read/HTTP1: 20.4%17.4%, Write/HTTP1: 16.5%10.1%)

Detailed breakdown for c128 Read HTTP/1 (highest pressure, most stable JFR signal):

Class main hashmap-alloc Change
HashMap$Node 6.9% 5.2% -1.7pp
HashMap$ValueIterator 1.3% 0.0% eliminated
DefaultHeaders$HeaderEntry 6.8% 4.4% -2.4pp
DefaultHeadersImpl 1.3% 0.04% -1.3pp
HttpHeader 0.9% 0.4% -0.5pp

JFR Allocation Comparison

Summary Chart

Summary Throughput

30-Tenant Benchmark Results

Test matrix: 30 tenants x {c3, c5} concurrency-per-tenant x {Read, Write} x {HTTP/1, HTTP/2}, GATEWAY mode, single cycle (~10 min steady-state). Metrics reported from tenant 0 (representative).

Config main ops/s PR ops/s Delta main p95 PR p95 main p99 PR p99
30t-c3 Read/HTTP1 1,211 1,182 -2.4% 3.11ms 3.15ms 5.70ms 5.72ms
30t-c3 Read/HTTP2 1,092 1,102 +0.9% 3.42ms 3.54ms 6.49ms 6.70ms
30t-c3 Write/HTTP1 558 559 +0.2% 6.18ms 6.18ms 7.48ms 7.99ms
30t-c3 Write/HTTP2 534 504 -5.6% 6.24ms 7.00ms 8.35ms 9.96ms
30t-c5 Read/HTTP1 1,586 1,587 +0.1% 4.13ms 4.15ms 6.41ms 6.46ms
30t-c5 Read/HTTP2 1,394 1,353 -2.9% 4.98ms 5.24ms 7.79ms 7.97ms
30t-c5 Write/HTTP1 894 937 +4.8% 6.69ms 6.16ms 9.29ms 7.72ms
30t-c5 Write/HTTP2 853 842 -1.3% 7.11ms 7.07ms 10.52ms 10.09ms

7 of 8 configs are within 3% throughput. The 30t-c3-Write/HTTP2 config shows -5.6% within ATM-induced between-run variance (runs separated by ~4 hours, ATM may route to different frontend). The 30t-c5-Write/HTTP1 config shows +4.8% throughput with p99 reduced from 9.29ms to 7.72ms (-17%).

Conclusion

  • Throughput: neutral overall (3% across all configs), within measurement noise
  • Variance: apparent regressions at c16/c32 are server-side ATM routing variability confirmed via TCP socket inspection (ss -i); hashmap-alloc has 4x lower throughput CV (1.8% vs 7.2%)
  • GC: identical (817 vs 813 pauses, same mean/p99)
  • Allocation efficiency: HashMap$ValueIterator eliminated; HashMap$Node -23%, DefaultHeaders$HeaderEntry -35% at c128
  • 30-tenant: neutral across 8 configs (7/8 within 3%); 30t-c5-Write/HTTP1 shows +4.8% with -17% p99
  • The changes remove unnecessary allocation overhead without regression. The benefit compounds at higher tenant density where allocation pressure and GC become bottlenecks.

Eliminate per-response intermediate HashMap allocation by adding a new
StoreResponse constructor that accepts HttpHeaders directly. Header names
and values are populated into String[] arrays without materializing an
intermediate Map. The JsonNodeStorePayload is updated to accept header
arrays and only builds a Map lazily on error paths (extremely rare).

Pre-size HashMaps throughout the hot path to avoid resize/rehash:
- HttpHeaders request construction: sized to defaultHeaders + request headers
- StoreResponse.replicaStatusList: pre-sized to 4
- StoreResponse.withRemappedStatusCode: pre-sized to header count
- RxDocumentServiceRequest fallback maps: pre-sized to 32

Fix HttpUtils.asMap() double-allocation by iterating HttpHeaders directly
instead of calling toMap() which creates an intermediate HashMap.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions bot added the Cosmos label Apr 1, 2026
Annie Liang and others added 2 commits April 1, 2026 09:53
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@@ -51,14 +52,14 @@ public static String urlDecode(String url) {

public static Map<String, String> asMap(HttpHeaders headers) {
if (headers == null) {
return new HashMap<>();
return new HashMap<>(4);
}
HashMap<String, String> map = new HashMap<>(headers.size());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also should make this instantiation

Suggested change
HashMap<String, String> map = new HashMap<>(headers.size());
HashMap<String, String> map = new HashMap<>(((int) headers.size() / 0.75F) + 1);

As internally, HashMap will resize once it hits a capacity factor of 0.75. Meaning this conversion has a map resize happening.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, yea will change, thanks ~~

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how many locations start doing this, may want to add in a helper method for this.

xinlian12 and others added 3 commits April 4, 2026 22:02
…ve null-guard inconsistency

- Fix HashMap<>(4) to HashMap<>(6) for replicaStatusList to avoid rehash
  at 4 replicas (capacity 4 * 0.75 = threshold 3, resizes on 4th insert)
- Refactor JsonNodeStorePayload: extract shared parseJson() method with
  Supplier<Map<String,String>> to eliminate duplicated error-handling logic
- Remove misleading null ternary in getHttpRequestHeaders() since
  getHeaders() always returns non-null (fallback HashMap<>(32))
- Revert HashMap<>(16) to HashMap<>() in HttpHeaders default constructor
  (16 is already the default capacity, change was no-op noise)
- Add unit tests for StoreResponse HttpHeaders constructor, HttpHeaders
  populateLowerCaseHeaders, and JsonNodeStorePayload array-header constructor

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix HashMap<>(headers.size()) in HttpUtils.asMap() to account for the
0.75 load factor, avoiding resize when all headers are inserted.

Extract mapCapacityForSize(int) helper in HttpUtils to consolidate the
capacity calculation (n * 4 / 3 + 1) used across HttpUtils.asMap(),
StoreResponse.withRemappedStatusCode(), and JsonNodeStorePayload.buildHeaderMap().

Addresses review feedback from alzimmermsft.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ize to HttpHeaders, clarify lowercase key guarantee

- Add comments explaining why HashMap<>(32) is used as fallback in
  RxDocumentServiceRequest: capacity 32 gives threshold 24, covering
  typical 15-20 request headers without resize.
- Apply HttpUtils.mapCapacityForSize() in RxGatewayStoreModel.getHttpRequestHeaders()
  to account for 0.75 load factor when constructing HttpHeaders.
- Make mapCapacityForSize() public so it can be used from other packages.
- Document in populateLowerCaseHeaders() Javadoc that keys are guaranteed
  lowercase because HttpHeaders.set() stores them via toLowerCase(Locale.ROOT).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
xinlian12 and others added 8 commits April 6, 2026 14:14
Extract duplicate contentStream/payload handling from both StoreResponse
constructors into a shared parseResponsePayload() static helper method.
Both constructors now use the array-based JsonNodeStorePayload constructor,
eliminating code duplication while preserving identical behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove both unescape(Set<Entry>) and unescape(Map) overloads from
  HttpUtils as they are no longer needed
- Update ResponseUtils to use the HttpHeaders-based StoreResponse
  constructor (same optimization as RxGatewayStoreModel)
- Remove unescape test from HttpUtilsTest, keep asMap() coverage
- Clean up unused imports (AbstractMap, ArrayList, List, Set)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace previous benchmark charts with comprehensive V3 analysis:
- 20 configs (c1/c8/c16/c32/c128 x Read/Write x HTTP1/HTTP2)
- 3 rounds each, 10 min/run, GATEWAY mode
- Timeline charts with throughput and P99 latency
- JFR allocation breakdown comparison
- Detailed per-round analysis of outlier patterns

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
JFR ObjectAllocationSample weight = estimated cumulative bytes allocated
over the recording (10 min), not heap residency. Heap was 8 GB committed.
The ~271 GB 'targeted' is allocation throughput (~4 GB/s alloc rate),
most objects immediately GC'd.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace confusing cumulative GB with allocation share %
- Add GC comparison table (817 vs 813 pauses - identical)
- Frame as code efficiency improvement, not GC impact

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…arison

- Remove timeline charts (not needed for review)
- Add variance analysis using request-level metrics (transitTime)
  showing variance is server-side, not SDK-related
- Add JFR allocation comparison for all 20 configs
- Keep summary bar chart and c128 JFR chart

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The c16 PR showing HashMap\ is from the request-sending
path (ReactorNettyClient.bodySendDelegate iterating request headers),
NOT the response-side asMap() iterator we eliminated. Added clarifying
note and removed the per-config ValueIterator column (too noisy).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants